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gestures. Therefore, the hand gesture technology is needed, in order to 
facilitate the public to interact with the disability. This study proposes 
Keywords: a reliable hand gesture recognition system using the convolutional neural 
network method. The first step, carried out pre-processing, to separate 
the foreground and background. Then the foreground is transformed using 
7 the discrete wavelet transform (DWT) to take the most significant subband. 
Hand gesture recognition The last step is image classification with convolutional neural network. 
The amount of training and test data used are 400 and 100 images 
repectively, containing five classes namely class A, B, C, #5, and pointing. 
This study engendered a hand gesture recognition system that had 
an accuracy of 100% for dataset A and 90% for dataset B. 
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1. INTRODUCTION 

Nowadays the technology growth leads the development of human physical recognition, 
one of which is the hand recognition for nonverbal communication. Signaling is a form of nonverbal 
communication frequently used in sign language and symbols [1]. Gestures have a major role in braiding 
the communication because it unconsciously -used as supplemental hint to the information that cannot be said 
verbally [2]. Past research was performed by taking two-dimensional hand gesture contours and converted it 
into one-dimensional signals using reference values. Wavelet decomposition was implemented for 
one-dimensional signals that were converted from two-dimensional contour images. The four statistical 
properties of wavelet coefficients then extracted and artificial neural networks (ANNs) were adopted to 
classify the gestures. It obtained the accuracy of 97% with 240 images collected from 20 people [3]. 
The issue with this research spotted on the system failure at recognizing the gestures with almost look-alike 
contours and that resulted the classifier to false-predict the output. More, the fuzzy images made it more 
difficult for classifier to predict the movements performed, so camera type and light intensity are very 
important in this research. Using both hands for experiment is also reduces accuracy. 
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The objective of this research is to design a hand gesture recognition system with discrete wavelet 
transform (DWT) method and convolutional neural network (CNN) classifier that has good performance. 
The benefit of DWT is that it can provide time and frequency information simultaneously and wavelets can 
be adjusted and adapted easily [4, 5]. CNN has the advantages that DNNs have, as it can have more than one 
hidden layer between the input and output [6-8]. In addition, allowing two-dimensional arrays or 
the dimensions in the input layer and the weight can be shared [9]. In CNN concept have feature extraction 
before classify, it can additionally feature to classify which has a similar gesture [10]. The problems 
including how to make the design and simulation of hand gesture recognition system to the datas ets, how 
the accuracy obtained using these method and classifier, and how the input influencing the parameters on 
systemperformance. 

Data acquisition performed by taking the sebastien marcel static hand posture database dataset as 
Dataset A [11] and Erizka’s dataset as Dataset B [12]. The structure of this paper consists of four sections. 
Section 1 describes introduction, section 2 describes research method, basic formulation of DWT, CNN, 
and system performance parameters, section 3 describes the performance of the hand gesture system, 
section 4 describes conclusion. 


2. RESEARCH METHOD 

A system has been designed to recognize hand gestures. In general, the purposed system 
is illustrated in Figure 1. The general scheme illustrated in Figure 1 would be the basis of this study. 
A description of the process in Figure 1(a) of the systemdesign as follows: 
— Input was atraining image database from a dataset that had red green blue (RGB) layer. 
— Pre-processing, which was the process of resizing the image and segment the skin color of the input 
image. Image resizing with size of 76x66 pixels for dataset A and 128x128 pixels for dataset B. 
The segmentation skin color by determining the pixel value threshold of YCbCr (Cr), filling noise, 
and closing. Then result from pre-processing would generated a hand contour image. 
Feature extraction with DWT. 
— Parameter training for forward and backward of training images in each class using CNN with mput in 

the form of feature vectorfrom training images. Result from this process is renewable weights value. 

In addition to the training process, there was also a testing process. A description for the process in 
Figure 1(b) of the systemdesign was as follows: 
— Input was a training image database from a dataset that had RGB layer. 


— Pre-processing, which was the process of resizing the image and segment of skin color of the input image 
which would generated a contour image. 

— Feature extraction with DWT. 

— Class prediction with CNN. The process that happened was a calculation for forward parameters 
and predict the class with the highest probability. 
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Figure 1. System flowchart, (a) Training, (b) Testing 
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2.1. Discrete wavelet transform 

Wavelet is a wave oscillation with zero initial amplitude, which can increase or decrease, thus form 
a fluctuating wave [13] illustrated in Figure 2. A crucial feature in wavelets is time-frequency 
localization [5]. The benefit of time-frequency localization is that it has variations in wavelet analysis 
on the time-frequency aspect ratio [14]. Then decompose the input, interpreted as a signal, using 
g(n) (low pass filter decomposition) and h(n) (high pass filter decomposition) [15]. The next step 
is down-sampling of two. The output is a low and high frequency signals. These two processes are done for 
the row section. It then repeated for the column, which then produces four sub-bands of output containing 
low and high frequency information [16]. A few examples of wavelet applications is data compression, voice 
signal recording, and music signal. The experiment result showed that the wavelet transform-based approach 
was better than existing minutiae-based methods and it required less response time which was more suitable 
for online verification [5, 17]. 


hey 
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Figure 2. Wavelet for continuous wavelet transform (CWT) and DWT applications 


2.2. Convolutional neural network 

Deep learning algorithm has a set of algorithms that would check a high-level abstraction 
model on data using a processing layer with a complex structure and consists of several non-linear 
transformations [18]. Deep learning can be supervised, semi-supervised, or unsupervised [8, 16, 19]. 
One example of deep learning is a deep neural networks (DNNs) [6]. The effective use of this type of neural 
network is in the case of unlabeled or unstructured data [20]. There are also a much faster yet simpler 
learning algorithm called extreme learning machines (ELM). But the main problem of this learning algorithm 
is hidden-neuron-sensitive [21]. 

In deep learning, convolutional neural networks (CNN, or ConvNet) is DNNs classes, which most 
commonly applied to analyze visual images. CNN is a neural network that subsist of a combination 
of convolutional layers, pooling layers, and fully-connected layers [22]. One of the uniqueness of CNN 
is that not all neurons in the earlier layer are connected to the next layer, making computing and training 
time faster [23]. The convolution layer is unique to CNN, where it has a collection of filters that can be used 
to learn inputimages illustrated in Figure 3. 
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Figure 3. CNN architecture 
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Convolutional layer processes data with a grid topology [7]. Through the convolution layer, 
features will be extracted and advance to the next layer in the hope of a more complex features would be 
extracted [19]. The process atthe CNN convolution layer is [1, 24]. 


cli jl = UK) lij] = ÈEnÈn Im, n)Kli—-m,j-— nl (1) 


Pooling layer, in the basis, is a resizing process which purposed is to reduce the number 
of parameters and calculation time needed when the training a network [25]. The third layer in CNN 
is a fully-connected layer, where this layer takes all the neurons in the preceding layer (convolutional layer 
and pooling layer) and connects them to every single neuron that exists in this layer [26]. The activation 
functions used are of two kinds, namely a rectrified linear unit (ReLU) with [27]: 


f(@ = max(0,z) (2) 


with x is the mput value to the neuron and this function is used only in the convolution layer. 
Then the second is the softmax function [10] with: 


f@, = SE 7 fOr i= L, K and Z= Zy us Zg E R“ (3) 


after the forward pass, there is one process functioned to improve the weight value so that it would be 
the same as expected. This process is called back-propagation [28]. The basic idea of back-propagation 
is the utilization of chain-rule to calculate the effect of changing-weight in the network to a cost function 
and minimizing the error function by employing gradient descent, so that the output from the network 
is the same as the expected output [29]. Error function is used to estimate the difference between the output 
ofa network with the expected output [23, 30]. 

Accuracy calculated from the classified image using CNN [31]: 


Accuracy (%) = 7 x 100% (4) 


with Nc is the amount of data detected correctly by the classification systemand N is the sumof all test data. 
While, computational time is the time used by the systemto perform a series of processes. 


3. RESULTS AND DISCUSSION 

There are 5 classes classified in the system, namely classes A, B, C, pointing, and # 5. System 
testing was finalized on five class of hand gestures. Using two types of datasets namely dataset A, from 
Sebastien Marcel Static Hand Posture Database [11], and dataset B, the from Author dataset [12]. Table 1 
shows the class specification. The differences between both dataset is that the resolution and quality. 
Dataset A had a diverse resolution, but mostly at 76x66 pixel and a low quality image. Whilst the Dataset B, 
because it was taken from the smartphone camera, had a resolution of 4128x3096 pixels and a higher quality 
image. Both dataset were through resizing process became 76x66 pixels. 


Table 1. Data specification 
Amount of data used 


Class Dataset A Dataset B 
A 50 50 
B 50 50 
C 50 50 
Pointing 50 50 
#5 50 50 
Total 250 250 


3.1. Layer testing on system performance 

The test using the percentage distribution of 80-20% training and testing for both datasets. Seven 
layers were used to determine which one was the best for this system. Note for the YCbCr and HSV, only 
the Cr and V layer were used respectively. Table 2 shows the effect of layer wavelet on system performance. 
Based on Table 3, the difference between the brightness and contrast value in the two datasets affected 
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the accuracy. For dataset B the binary layer was taken, it was normal. But for the dataset A, because 


the accuracy was the same for Cr from YCbCr, binary, grayscale, and V from HSV layer, the binary then 
chosen for simplicity. 


Table 2. Layer test results 


Layer Accuracy (%) Computatation time (s) 

Dataset A. DatasetB Dataset A Dataset B 
Red 96 82 1,31 20 
Green 98 76 1,40 21 
Blue 96 72 1,40 20 
YCbCr (Cr) 100 74 1,30 21 
Binary 100 88 1,70 29 
Grayscale 100 a2 1,39 20 
HSV(V) 100 74 2,02 24 


3.2. Sub-band testing on system performance 

In this section, testing uses the distribution of training percentages and 80-20% testing for both data 
sets. Input image using binary layer. Four subband at decomposition level 1, namely LL, LH, HL, and HH 
were tested to determme which subband wavelete is the best for this system. Table 3 shows the effect 
of wavelet subband on system performance. Based on Table 3, the LL subband in both datasets produces 
the highest systemaccuracy. Both datasets take the LL sub-band because this sub-band produces finer image 
contours than the other subband, as shown in Figure 4. There are negative values in the images in the LH, 
HL, and HH sub-bands that cause image information to be lost. 


Table 3. Sub-band test results 


Accuracy (%) Computatation time (s) 
pub: pang Dataset A Dataset B Dataset A Dataset B 
LL 100 92 1,3 25 
LH 94 72 1,60 36 
HL 96 712 1,50 19 
HH 94 60 1,6 19 





Figure 4. Display of sub-band test results using dataset A 


3.3. Level decomposition testing on system performance 

In this section, testing uses the distribution of training percentages and 80-20% testing for both data 
sets. Input images use binary layers, and subband low are used in this test. Four decomposition levels 
are tested to determine which wavelet decomposition level is the best for this system. Table 4 shows 
the effect of wavelet decomposition level on system performance. Based on Table 4, if the decomposition 
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level increases, the accuracy of the system decreases. That is because the resolution of the mage will be 
smaller so that the features taken are less or less. The smaller image size can cause information to be lost. 
The illustration in Figure 5 can visually explain what happens when the decomposition level is raised. 
The greater the level, the smaller the resolution of the image. But for the sake of visualization, the image size 
is enlarged so that the image can be seen. 


Table 4. Level decomposition test results 


Decomposition level Accuracy (%) Computatation time (s) 
Dataset A Dataset B Dataset A Dataset B 
1 100 92 1.4 25 
2 96 74 1.5 28 
3 86 78 1.2 35 
4 66 74 1.2 31 


Level 4 





Figure 5. Displayed level decomposition for dataset B 


3.4. Testing the mother wavelet types on system performance 

In this section, testing uses the distribution of training percentages and 80-20% testing for both data 
sets. Input images is binary layers. Subband low, and | level decomposition wavelet are used in this test. 
Types of mother wavelet are tested to determine which type of wavelet is the best for this system. Table 5 
shows the effect of types of mother wavelet on system performance. Based on Table 5, types of mother 
wavelets can make a difference in the characteristics of an object. Haar Wavelet is able to represent 
the characteristics of texture and shape well. Haar wavelet types produce higher systemaccuracy than other 
wavelet types. 


Table 5. Mother wavelet test results 


Accuracy (%) Computatation time (s) 
NO ENERWWEVEIEL Dataset A Dataset B Dataset A Dataset B 
Haar 100 93 1,3 32 
db3 100 82 1,6 19 
db5 100 90 1,5 19 
db7 98 84 1,4 19 


3.5. Learning rate testing on system performance 

In this section, testing parameters CNN learning levels based on the level of accuracy 
and computation time system. Input images is binary layers. Subband low, 1 level decomposition, and haar 
wavelet are used in this test. Table 6 shows the effect of learning rate on system performance. Based on 
Table 6, learning rate affects how much the weight value can be adjusted by taking into account the cost 
function value. The lower the learning rate, the more precise the movement of the gradient in the direction to 
the valley (downward slope). However, by reducing the value of learning rate, the computational time needed 
is longer. In Figure 6, it can be seen that when the learning rate is 0.1 the traming system has a very high 
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loss at the beginning of iteration. Resulting in a decrease in the accuracy of the recognition system. 


Whereas when the learning rate is 0.01, the training system has a lower loss at the beginning of iteration so 
that the systemworks more optimally. 


Table 6. Learning rate test results 


Accuracy (%) Computatation T ime (s) 
eae nate Dataset A Dataset B Dataset A Dataset B 
0,1 88 78 1.3 19 
0,01 100 80 1.4 19 
0,001 98 90 1.1 28 
0,0001 98 70 1.1 18 





0 10 20 30 40 50 60 TÜ Bü 90 100 
lteration 
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Figure 6. Loss graph, (a) Learning rate at 0.1, (b) Learning rate at 0.01 


3.6. Epoch testing on system performance 

In this section, testing the epoch parameters used in the CNN classifier on the accuracy 
and computational time of the system. The epoch values to be analyzed are 10, 50, 100, 150, 200 for dataset 
A and dataset B. Table 7 shows the effect of epoch on system performance. Based on Table 7, the result 
was influenced by quality of the image parameters, such as brightness and contrast. More epoch needed to 
learn a higher quality images. However, if the epoch was wrong-selected, it would just elongated the time but 
the accuracy would not necessarily increased. The epoch taken for dataset A was 100, for simplicity. For this 
work, two datasets was used because of two things: first to investigate the CNN capabilities to classify 
the image at smaller (76x66) or bigger (128x128) pixel, secondly to check how CNN classifies image with 
low brightness and contrast. 


Table 7. Epoch test results 


Epoch Accuracy (%) Computatation T ime (s) 

pog Dataset A Dataset B Dataset A Dataset B 
10 98 68 1,2 19 
50 96 78 1,3 19 
100 100 90 1,3 19 
150 100 76 1,3 19 
200 100 74 1,3 19 
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3.7. Testing data on system performance 

The goal for this test was to determine the best partition between training and testing data for DWT. 
The partition percentage of training data and test data to be examined were 80% -20% (40 training and 10 
test data), 70% -30% (35 training and 15 test data), and 50% -50% (25 training and 25 test data) 
of the dataset for each type of gesture as displayed in Table 8. DWT requires a lot of training data so that 
the system could work optimal. It can be seen that images with good quality (dataset B) are more difficult to 
train and test because of more varied parameter values compared to lower one (dataset A). 


Table 8. Data test results 


o Datatesine Accuracy (%) Computatation time (s) 
Dataset A Dataset B Dataset A Dataset B 
80%-20% 100 90 1.25 19 
70%-30% 98.67 78.67 1.25 19 
50%-50% 96.80 83 1.25 19 


4. CONCLUSION 

In this paper, a hand gesture recognition implementation system using discrete wavelet transform 
and convolutional neural networks was proposed. The problem at classifying similar yet different gestures, 
completed by adopting the DWT and CNN classification system. Dataset A’s best results using the binary 
layer, type LL sub-band, level one decomposition, Haar wavelet in the DWT parameter, the learning rate was 
0.01 and the number of epochs was 100 the CNN parameter. In dataset B, it had the best parameters using 
binary layer, LL sub-band type, level one decomposition, Haar wavelet on DWT parameters, learning rate 
value of 0.001 and the number of epochs was 100 on the CNN parameter. Hand gesture recognition system 
with DWT and CNN using dataset A that was accomplished from the best parameters had an accuracy 
of 100%. In dataset B, it had an accuracy at 90% from the best parameter. 
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